Combining probability models and web mining models: a framework for proper name transliteration

نویسندگان

  • Yilu Zhou
  • Feng Huang
  • Hsinchun Chen
چکیده

The rapid growth of the Internet has created a tremendous number of multilingual resources. However, language boundaries prevent information sharing and discovery across countries. Proper names play an important role in search queries and knowledge discovery. When foreign names are involved, proper names are often translated phonetically which is referred to as transliteration. In this research we propose a generic transliteration framework, which incorporates an enhanced Hidden Markov Model (HMM) and a Web mining model. We improved the traditional statistical-based transliteration in three areas: (1) incorporated a simple phonetic transliteration knowledge base; (2) incorporated a bigram and a trigram HMM; (3) incorporated a Web mining model that uses word frequency of occurrence information from the Web. We evaluated the framework on an English–Arabic back transliteration. Experiments showed that when using HMM alone, a combination of the bigram and trigram HMM approach performed the best for English–Arabic transliteration. While the bigram model alone achieved fairly good performance, the trigram model alone did not. The Web mining approach boosted the performance by 79.05%. Overall, our framework achieved a precision of 0.72 when the eight best transliterations were considered. Our results show promise for using transliteration techniques to improve multilingual Web retrieval.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Hybrid Model for Extracting Transliteration Equivalents from Parallel Corpora

Several models for transliteration pair acquisition have been proposed to overcome the out-of-vocabulary problem caused by transliterations. To date, however, there has been little literature regarding a framework that can accommodate several models at the same time. Moreover, there is little concern for validating acquired transliteration pairs using up-to-date corpora, such as web documents. ...

متن کامل

Named Entity Translation with Web Mining and Transliteration

This paper presents a novel approach to improve the named entity translation by combining a transliteration approach with web mining, using web information as a source to complement transliteration, and using transliteration information to guide and enhance web mining. A Maximum Entropy model is employed to rank translation candidates by combining pronunciation similarity and bilingual contextu...

متن کامل

Maximum n-Gram HMM-based Name Transliteration: Experiment in NEWS 2009 on English-Chinese Corpus

We propose an English-Chinese name transliteration system using a maximum N-gram Hidden Markov Model. To handle special challenges with alphabet-based and characterbased language pair, we apply a two-phase transliteration model by building two HMM models, one between English and Chinese Pinyin and another between Chinese Pinyin and Chinese characters. Our model improves traditional HMM by assig...

متن کامل

Comparing Two Techniques for Learning Transliteration Models Using a Parallel Corpus

We compare the use of an unsupervised transliteration mining method and a rulebased method to automatically extract lists of transliteration word pairs from a parallel corpus of Hindi/Urdu. We build joint source channel models on the automatically aligned orthographic transliteration units of the automatically extracted lists of transliteration pairs resulting in two transliteration systems. We...

متن کامل

Sideways Transliteration: How to Transliterate Multicultural Person Names?

In a global setting, texts contain transliterated names from many cultural origins. Correct transliteration depends not only on target and source languages but also, on the source language of the name. We introduce a novel methodology for transliteration of names originating in different languages using only monolingual resources. Our method is based on a step of noisy transliteration and then ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Information Technology and Management

دوره 9  شماره 

صفحات  -

تاریخ انتشار 2008